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Abstract 

We  present  a  high-level  parallel  calculus  for  ne,ste(l  sequences, 
.VSC.  offered  ;is  a  possible  theoretical  -core"  of  an  entire 
class  of  collection-oriented  parallel  languages.  .VSC  is  based 
on  ii'/ii/e-loops  as  opposed  to  general  recursion.  formal, 
machine  independent  definition  of  the  parallel  time  complex¬ 
ity  and  the  work  complexity  of  programs  in  A  SC  is  given. 
Oiir  main  results  are:  (1)  \te  give  a  translation  method  for 
a  particular  form  of  recursion,  caUed  map-recursion,  into 
.VSC.  that  preserves  the  time  complexity  and  adds  an  arbi¬ 
trarily  small  overhead  to  the  work  complexity,  and  (2)  We 
give  a  compilation  method  for  .\  SC  into  a  very  simple  vec¬ 
tor  parallel  machine,  which  preserves  the  time  complexity 
and  again  adds  an  arbitrarily  small  overhead  to  the  work 
complexity. 

1  Introduction 

There  are  many  advantages  to  programming  in  a  high-level 
language.  However,  while  sequential  algorithms  are  most 
of  the  time  designed  and  evaluated  in  reasonably  high-level 
terms,  the  situation  with  parallel  algorithms  is  -  by  neces¬ 
sity.  so  far  -  more  complicated.  The  issue  is  intimately  con¬ 
nected  with  the  existing  'dforts  to  bridge  the  gap  between 
the  theoretical  design  of  parallel  algorithms  and  practical 
Iirogrammiiig  on  massively  parallel  computers. 

In  t  hf-  e.'Lse  of  data  ixirrilh  li.'itn.  t  he  work  of  Blelloch  [BlettO. 
Blc9:?]  and  Blelloch  and  .'^abot  [BS9()]  has  made  substantial 
progress  on  this  issue.  For  example,  if  we  manage  to  rep¬ 
resent  an  algorithm  in  a  high-level  language  such  as  NESL 
with  a  certain  work  and  time  (a.k.a.  element  or  step)  com¬ 
plexity  and  if  the  representation  satisfies  certain  restrictions 
then  we  are  giiaraiiteetl  an  itnplenieiitatioii  of  the  same  algo¬ 
rithm  with  the  same  asymptotic  time  and  work  complexity 
in  terms  (d  a  low-level  parallel  vector  model,  which  in  turn 
admits  eificient  implementations  on  various  architectures, 
for  example  the  C'MJ.  The  present  jtaper  is  proposing  a 
different  treatment  of  similar  goals. 

*Tli'-  .’M'.jfs  par'iiiiU'  j 'Orf i  Ly  NSF  <irrviit  (  f  K-.m)- 


We  start  with  a  somewhat  abstract  high-level  language 
which  represents  and  manipulates  mostly  nested  sequences 
(lists)  and  so  we  called  it  for  nested  sequence  calculus 

(section  3).  We  regard  AfSC  as  a  possible  theoretical  “core” 
of  an  entire  class  of  collection-oriented  parallel  languages.  In 
keeping  with  the  tenets  of  data  parallelism  [HS86],  jVSC's 
only  parallel  operation  is  map  (apply-to-aU).  We  give  a  pre¬ 
cise  high-level  definition  of  parallel  complexity  (in  the  work 
and  time  framework  [Jaj92])  for  JVSC  programs. 

Blelloch  [Ble90,  Ble93]  gives  convincing  evidence  that 
nested  map's  on  nested  sequences  (what  he  calls  nested  par¬ 
allelism)  can  enhance  the  expressiveness  of  a  data  parallel 
language.  But  these  high-level  features  are  quite  removed 
from  concrete  parallel  architectures  or  even  the  parallel  vec¬ 
tor  model  and  need  to  be  compiled  away.  Unnesting  the 
nested  parallelism  is  at  the  center  of  the  compilation  tech¬ 
nique  of  (Ble90,  BS90.  Ble93].  However,  in  a  language  with 
general  recursion,  this  technique  is  guaranteed  to  preserve 
the  asymptotic  paraUel  complexity  only  for  programs  that 
satisfy  a  certain  semantic  condition  called  containement. 

,VSC  is  based  on  ui/ii/e-loops  rather  than  general  recur¬ 
sion.  This  will  surely  impose  some  limitations,  although  not 
that  many:  our  first  main  result  consists  of  showing  that  a 
large  and  practically  relevant  class  of  programs,  called  map- 
recursive.  can  be  translated  into  .VSC  whUe  asymptotically 
preserving  the  time  comple.xity  and  adding  an  arbitrarily 
small  overhead  to  the  work  complexity  (theorem  4.2).  It 
even  turns  out  that  some  recursive  programs  which  are  not 
contained  in  the  sense  of  [Ble90]  are  in  fact  map-recursive. 
The  major  benefit  however  is  that  we  can  compile  .VSC 
without  the  need  for  an  unbounded  stack  of  vectors,  as  gen¬ 
eral  recursion  would  require.  .A. voiding  the  stack  is  a  good 
idea  because  SIMD  architectures  associate  a  relatively  smaU 
memory  with  each  processor.  .A  program  that  generates 
many  entries  in  its  vector  stack  will  run  out  of  memory  even 
if  the  vectors  are  very  short  and  hence  much  of  the  total 
amount  of  memory  of  the  machine  remains  unused.  We 
believe  that  our  compilation  technique  can  lead  to  better 
memory  management.  Of  co\irse,  this  needs  to  be  tested  in 
practice. 

Following  Blelloch,  we  define  a  simple  parallel  vector 
model  in  order  to  describe  abstractly  the  class  of  target  ar¬ 
chitectures  for  our  compilation  method  (section  2).  Our 
BVR.A.M  (Bounded  Vector  Random  .Access  Machine)  differs 
from  the  \'RAM  [Ble90]  primarily  in  that  it  has  a  finite  num¬ 
ber  of  vector  registers.  This  emphasizes  the  absence  of  a  run¬ 
time  vector  stack.  Of  course  the  number  of  registers  needed 


depends  on  the  source  program  being  compiled.  Another 
important  difference  is  that  we  need  less  powerful  communi¬ 
cation  primitives.  The  BVRAM  has  no  general  permutation 
instruction,  and  its  communication  primitives  can  be  imple¬ 
mented  on  a  butterfly  network  with  n  log  n  nodes  in  0(log  n) 
steps.  The  BVRAM  can  be  efficiently  implemented  on  SIMD 
architectures  such  as  CM2  and  MasPar  MP-1,  and  it  has  the 
potential  of  efficient  implementation  on  MIMD  machines  as 
well,  such  as  CMS,  Paragon  XP/S,  KSRl  etc. 

Our  second  main  result  is  a  technique  that  compiles  any 
AfSC  program  into  a  BVRAM  program  again  while  asymp¬ 
totically  preserving  the  time  complexity  and  adding  an  arbi¬ 
trarily  small  overhead  to  the  work  complexity  (theorem  7.1). 
Along  the  way  we  also  give  a  simulation  that  allows  us  to 
understand  AfSC  complexity  in  terms  of  the  complexity  of 
computations  on  a  certain  flavor  of  PRAM  (proposition  3.2), 
we  show  how  to  implement  the  BVRAM  instructions  on  a 
butterfly  network  (proposition  2.1),  we  connect  AfSC  with 
some  standard  parallel  complexity  classes  (proposition  6.2), 
we  show  how  to  represent  in  AfSC  Valiant’s  0(log  nlog  log  n) 
time  sorting  algorithm  [Val75,  Jaj92]  (section  5),  and,  as 
part  of  the  compilation  process,  we  define  an  intermediate 
abstract  language  -  the  sequence  algebra  -  which  has  the 
same  power  as  BVRAM’s  but  may  prove  more  flexible  in 
connecting  to  the  designs  of  the  future  (section  7). 

AfSC  borrows  heavily  from  our  experience  with  languages 
for  collection  types  [BTS91,  BBW92]  and  it  is  worthwhile 
mentioning  that  many  of  its  operations  make  as  much  sense 
for  sets  and  bags  (multisets)  as  for  lists  (sequences).  It  mat¬ 
ters  to  us,  though  it  may  not  be  so  relevant  to  the  goals  of 
this  paper,  that  AfSC  is  based  on  a  clear,  statically  checkable 
type  system,  that  we  understand  the  meaning  of  AfSC  pro¬ 
grams  independently  of  their  parallel  execution,  and  that  we 
know  how  to  reason  about  them  -  for  example  how  to  vali¬ 
date  source  to  source  optimizations.  We  have  in  mind,  appli¬ 
cations  to  databases  and  this  naturally  brings  up  important 
complexity  issues.  In  a  previous  paper  we  have  shown  a 
tight  connection  between  a  related  data  parallel  language 
for  sets  and  the  class  NC  [SBT94].  This  in  turn  has  led  us 
to  the  more  practical  questions  addressed  here. 

2  The  Target:  Bounded  Vector  Random  Ac¬ 
cess  Machines 

To  compile  the  higher  level  programming  language  described 
in  section  3  only  a  very  simple  vector  parallel  model  is 
needed.  The  Bounded  Vector  Random  Access  Machine, 
BVRAM,  is  a  restriction  of  the  VRAM  introduced  in  [Ble90], 
in  that  it  only  admits  a  fixed  number  of  registers,  and  has 
only  particular  communication  primitives,  not  a  general  per¬ 
mutation.  The  BVRAM  can  be  efficiently  implemented  on 
a  wide  range  of  parallel  architectures,  because:  (1)  only  a 
simple,  rather  particular  form  of  communication  is  needed  to 
implement  every  instruction  of  the  BVRAM,  and  (2)  mem¬ 
ory  management  at  each  processor  is  simplified  by  having 
only  a  bounded  number  of  vector  registers,  as  opposed  to 
an  unbounded  number  in  the  VRAM  model. 

A  BVRAM,  M,  consists  of  afixed  number  of  vector  reg¬ 
isters  Vi,  ....  Vr-  Each  V,  can  hold  a  sequence  (a  vector)  of 
natural  numbers  of  arbitrary,  but  finite  length.  To  keep  the 
model  simple,  we  don’t  include  scalar  registers:  a  number  is 
represented  by  a  sequence  of  length  1.  A  program  for  M  is  a 
sequence  of  labeled  instructions,  from  the  following  instruc¬ 
tion  set.  For  some  of  the  instructions  below,  it  is  convenient 


to  view  a  pair  of  registers  Vi ,  Vj  in  which  the  length  of  the 
first  equals  the  sum  of  the  numbers  in  the  second  as  a  nested 
sequence.  E.g.,  intuitively  we  view  [xo,  xi,  zo,  zi,  Z2],[2,0,3] 
as  standing  for  the  nested  sequence  [[ro,  2^1],  Di  [^0,  zi ,  ^2]]- 

•  Move  instruction:  Vi  <—  1^  . 

•  Arithmetic  operations,  of  the  form  Vi  Vj  op  14. 

Here  op  is  an  arithmetic  operation  from  a  set  E.  V, 
and  Vk  must  be  arrays  of  the  same  length,  and  the 
operation  op  is  applied  simultaneously  on  aU  all  ele¬ 
ments  of  Vj  and  14  from  the  same  positions,  and  the 
result  is  stored  in  Vi.  In  general  we  leave  E  unspeci¬ 
fied,  but  mention  here  that  for  theorems  4.2  and  7.1 
E  has  to  contain  right-shift, log^,  while  for 

proposition  6.2  we  require  that  all  operations  in  E  be 
in  NC.  Monus,  written  m  —  n,  is  defined  as  m  —  n 
when  m  >  n  and  0  otherwise. 

•  Sequence  oriented  operations:  Vi  *—  []  loads  the  empty 
sequence  in  Vi.  Vi  ^  [n],  where  n  £  N  loads  the 
singleton  sequence  [n]  into  V,.  V,  <—  Vj@Vk  appends 
Vj  and  Vj;  and  stores  the  result  in  Vi.  Vi  <—  [length{Vj)] 
computes  the  length  of  Vj.  Vi  <—  enumerate{Vj)  loads 
the  sequence  [0, 1, . . . ,  n  -  1]  into  Vi,  where  n  is  the 
length  of  Vj. 

•  Bounded  monotone  routing  Vi  <—  bm-route{Vj ,Vk,Vi); 
here  14  and  Vj  must  have  the  same  length.  The  ef¬ 
fect  is  that  each  element  in  Vi  is  replicated  a  num¬ 
ber  of  times  equal  to  the  corresponding  number  in  Vj,. 
In  addition,  it  is  required  that  the  result  matches  in 
length  the  sequence  Vj  (i.e.  initially  Vj,Vk  represent 
a  nested  sequence).  E.g,  if  V)  =  [xo,xi,zo,zi,Z2], 
Vk  =  [2,0,3]  and  V  =  [a,6,  c],  then  the  instruction 
Vi  <-  bm.route{Vj  ,Vk,Vi)  stores  [a,  a,  c,  c,  c]  into  Vi. 

•  Segmented  bounded  monotone  routing  Vi  «—  sbm.route 
{Vj,  14,  Vi,  Vm).  Here,  Vj,Vk  and  Vi,  Vi„  must  be  nested 
sequences,  and  length{Vk)  =  length{Vm)-  Then,  the 
subsequences  of  Vi  are  replicated  according  to  the  num¬ 
bers  in  Vk  and  the  result  is  stored  in  Vi.  E.g.,  suppose 
Vj  =  [xo,  xj,  zo,  zi,  Z2],  Vk  =  [2,0,3],  Vi  =  [ao,ai,6o, 
bi,  fc2,co,C],C2]  and  Vin  =  [2,3,3].  Then,  after  V,  <— 
sbm-route{Vj,  V4,  Vj,  Vm),  Vi  will  hold  the  value  [ao,  ai , 
oo,  oi ,  Co,  Cl ,  C2,  Co,  C] ,  C2,  Co,  Cl ,  C2].  In  the  particular 
case  when  Vi,  Vin  have  length  one,  this  computes  the 
cartesian  product  oi  Vj  and  Vj.  Note  that  the  length 
of  the  output  is  <  length(Vj)  *  length{Vi)  and  that 
bm-route  can  be  expressed  with  two  sbm-route  instruc¬ 
tions. 

•  Selection  Vi  cr{Vj).  The  effect  is  that  the  nonzero 
values  of  Vj  are  packed  and  moved  into  V,.  E.g.  if 
Vj  =  [3,  0, 1,  0,  0,  4],  then  [3, 1,  4]  is  stored  in  Vi, 

•  The  unconditional  jump  goto  I  and  the  conditional 
jump  if  empty'?{V,)  then  goto  I,  where  I  is  a  label  of 
some  instruction.  The  conditional  jump  is  taken  iff  V, 
currently  holds  the  empty  sequence. 

•  halt,  stops  the  program. 

We  associate  with  each  BVRAM  program  P  two  num¬ 
bers:  TtyTo,  the  number  of  input  and  output  registers,  P 
expects  r,  inputs  in  the  registers  14  , . . . ,  Vr, ,  and  returns 
To  outputs,  in  V4 , . . . ,  Vr„ .  For  some  input,  the  result  of  P 
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might  be  undefined,  if  P  enters  an  infinite  loop,  or  if  an  er¬ 
ror  occurs.  For  a  terminating  execution  of  P,  we  define  the 
parallel  time  complexity  T  to  be  the  total  number  of  in¬ 
struction  executed  by  P,  i.e.  each  instruction  is  considered 
to  have  parallel  time  complexity  1.  Similarly,  we  define  the 
work  complexity  W  as  the  sum  of  the  work  complexities 
of  all  instructions  executed  by  P,  where  the  work  complexity 
of  some  instruction  is  defined  to  be  the  sum  of  the  lengths 
of  its  input  and  output  registers. 

As  opposed  to  VRAMs  [Ble90]  there  is  no  general  permu¬ 
tation  instruction  on  a  BVRAM  (but  one  can  be  computed 
with  an  increase  in  the  time  or  work  complexity).  This  may 
lead  to  more  efficient  implementations  on  fixed-connection 
networks,  as  exemplified  by  the  following  proposition. 

Proposition  2.1  Any  BVRAM  instruction  of  work  com¬ 
plexity  W  can  be  implemented  in  time  0(log  n)  on  a  butter¬ 
fly  network  with  n  log  n  nodes,  where  n  =  0(W),  using  only 
oblivious  routing  algorithms. 

Proof.  (Sketch)  The  arithmetic  operations  involve 
no  communication  at  all,  thus  can  be  implemented  in  0(1) 
steps.  The  append  operation  Vi  <—  V)@14  only  requires  a 
monotone  routing  of  the  values  in  14.  This  can  be  done 
in  0(log  n)  steps,  using  the  greedy  routing  algorithm,  see 
[Lei92],  pp.  534.  bm.route  is  implemented  by  a  monotone 
routing,  and  takes  0(log  n)  steps  with  the  greedy  algorithm. 
For  sbm-route,  suppose  first  that  length{Vj)  =  length{Vi)  = 
1,  i.e.  sbm.route  computes  the  cartesian  product  of  V,  and 
14.  Also,  suppose  that  the  length  of  Vi  and  14  are  pow¬ 
ers  of  2,  namely  2^  and  2’  respectively.  Take  n  =■  2^"*'’; 
then  we  have  2^  packets  residing  in  the  first  2^  tows  of  a 
butterfly  with  2^'^^  rows,  and  we  have  to  route  the  packet 
with  address  00  . . .  Oup-i  . . .  «i  uo  to  all  addresses  of  the  form 
Vq-i  . . .  vivoUp-i  ...  Ml  Mo-  This  is  done  in  g  stages,  starting 
with  the  higher  dimension,  using  the  greedy  algorithm.  In 
the  general  case  of  sbm-route,  we  have  to  replicate  a  num¬ 
ber  of  smaller  sequences.  First,  round  upwards  to  the  closest 
power  of  2  the  length  of  each  such  subsequence,  and  spread 
the  sequences  such  that  each  sequence  of  lenght  m  starts 
at  an  address  divisible  by  m.  Next,  perform  in  parallel  all 
replications,  as  described  above.  □ 

When  the  number  n  of  available  processors  is  less  than 
the  number  fV  of  elements  in  an  array,  then  we  group 
adjacent  elements  of  the  array  in  the  same  processor.  The 
above  proposition  can  be  extended  to  this  case:  some  in¬ 
struction  of  complexity  W  can  be  implemented  on  a  butter¬ 
fly  network  in  0(^  log  n)  steps. 

3  The  Source:  The  Nested  Sequence  Calculus 

(A/’5C) 

We  use  types  to  explain  the  structure  of  AfSC  and  classify 
its  features.  The  types  are  given  by  the  grammar  t  ::=  unit  | 
N  I  t  X  t  I  t  -f  <  I  [t].  unit  has  exactly  one  value:  the  empty 
tuple  ().  N  is  the  type  of  nonnegative  integers.  The  values 
of  the  product  type  s  xt  are  pairs  {x,y),  with  x  £  s,y  £ 
t.  [t]  is  the  finite  sequences  type  over  f:  it  contains  all 
sequences  [xo,  .  .  .  ,  Xn-i],  with  n  >  0  and  Xo,...,x„-j  £  t. 
s  1  is  the  disjoint  union  type  of  s  and  <;  its  values  are 
of  the  form  ini(x)  with  x  £  s  and  in2(y)  with  y  £  t.  We 

define  the  boolean  type  B  '*=  unit  -+■  unit,  and  identify  its 
values  ini{())  and  2712(0)  with  true  and  false  respectively. 
Extending  the  list  of  built-in  types  with  reals,  strings,  etc., 
can  be  done  while  preserving  all  results. 


The  primitives  of  AfSC  are  chosen  to  be  operations  nat¬ 
urally  associated  to  its  types.  Its  expressions  belong  to 
one  of  two  distinct  syntactic  categories:  terms,  denoted 
by  M,  N,  P,U,V,  etc.,  which  have  some  type  t,  and  func¬ 
tions,  denoted  by  F,G,  etc.,  have  associated  two  types,  the 
domain  s  and  codomain  t.  By  abuse  of  the  language  we  say 
in  this  case  that  the  “type”  of  some  function  F  is  s  ^  t. 
However  s  — ►  <  is  not  a  type  per  se,  which  makes  constructs 
like  s  — >  (ti  — >•  <2)  or  (si  —^S2)—*■t  impossible.  Both  terms 
and  functions  may  contain  free  variables.  See  appendix  A 
for  a  full  and  formal  description  of  the  language): 

•  Variables  i,  error  constants  n  (where  n  £  N), 

arithmetic  operations  M  op  N ,  where  op  6  E  (recall 
from  section  2  that  E  =  {-|-,  — ,  .}),  and  equality 

M  =  N. 

•  Constructs  associated  with  the  product  type:  (),  xi ,  772, 
(M,N).  Here  ()  denotes  the  empty  tuple,  {M,N)  is  a 
pair,  while  xi,  X2  are  the  projections,  with  the  meaning 

7ri(a;,y)  =  x,7r2(x,y)  =  y. 

•  Constructs  associated  with  the  sum  type:  ini(M), 
in2{N),  and  case  M  of  ini(x)  =>  N  in2{x)  =>  P. 
The  latter  is  defined  to  be  equal  to  N[U/x]  when  M  = 
ini(U),  and  respectively  to  P[V/y],  when  M  —  in2(V). 

•  Constructs  associated  with  functions:  Ax  :  s.M  and 
F{M).  The  former  is  called  a  lambda  abstraction,  and 
is  a  function  (as  opposed  to  a  term),  of  type  s  — >■  t, 
provided  that  M  is  a  term  of  type  t.  The  second 
construct,  F[M),  is  a  term  called  function  application 
having  type  t,  provided  that  F  is  some  function  of  type 
s  t,  and  M  is  a  term  of  type  s.  Although  the  type 
s  is  part  of  the  syntax  of  Ax  :  s.M ,  we  shaO  drop  it 
when  it  is  clear  from  the  context.  Note  that  Ax  :  s.F, 
where  T  is  a  function,  is  not  a  legal  construct  in  AfSC, 
nor  is  Ax  :  s  — I-  t.M,  i.e.  no  higher  order  functions  are 
allowed. 

•  Iteration:  while(P,F)  is  some  function  of  type  t  —>■  t, 
provided  that  P  and  F  are  functions  of  type  t  — ♦  B 
and  t  — <•  t  respectively. 

•  Constructs  associated  with  collections  (these  constructs 
work  on  sequences  but  also  make  sense  for  other  kinds 
of  collections,  like  sets  and  bags  [BBW92]):  [],  [M]  , 
M@N,  flatten[M),  length{M),  get(M),  and  map{F). 
Here  j]  denotes  the  empty  sequence,  [M]  is  the  single- 
ton  sequence,  and  @  is  the  append  operator.  Next, 

flatten  ([xo, . . . ,  x„_i])  "=  xo@xi@  . . .  @x„_i,  and 

length(M)  returns  the  length  of  some  sequence,  get  is 

defined  by  get([x])  =*  x,  get(O)  ='  get{[xa,  xi,  . . .])  =* 
n.  Finally  map{F)  is  a  function  of  type  [s]  ^  [t],  pro¬ 
vided  that  F  is  a  function  of  type  s  — ►  f.  Its  meaning 

is:  fnap(F)([xo, . . . ,  x„_i])  ='  [F(xo), .  . . ,  F(x„_i )]. 

•  Constructs  associated  only  to  sequences,  and  not  to 
other  kinds  of  collections:  zip{M ,  N),  enumerate(M), 
and  split{M,  N).  The  meanings  are:  zip{[xo, . . . ,  x„_i], 

[7/0,...,?/„_l])  [(xo,7/o),.  ..,(x„_i,7/n-l)]  («*p  is 

undefined  if  its  two  arguments  have  different  lengths), 

def 

erj«merate([xo, . . . ,  Xn_i])  =  [0,...,n  — 1].  Finally 
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split(M,  N)  splits  M  according  to  the  numbers  con- 

def 

tained  in  TV;  e.g.  spUt{[a,b,c,d,e,  0,1,0, 2])  = 

[[a,  i,  c],  Q,  [d],  0,  [e,  /]].  It  is  defined  only  if  the  sum  of 
elements  in  N  equals  the  length  of  M . 

Note  that  any  function  in  AfSC  can,  in  a  fixed  amount  of 
time,  only  increase  the  size  of  its  input  by  some  polynomial. 
Had  we  introduced  as  a  primitive  in  the  language  something 

like  i{n)  '*=  [0, . . .  ,n-l],  which  generates  an  arbitrarily  long 
list  out  of  a  number,  this  property  would  fail.  From  this 
small  set  of  primitives,  we  can  derive  a  rich  set  of  functions. 
Some  examples; 

def 

Database  projections.  H;  ;  [ti  x  ta]  [<t],  Hi  = 

map[i:i). 

Conditionals.  \1  x  =  y  then  M  else  N  is  expressed  by 
case  {x  =  y)  of  mi  (?/.)  =t>  M  in^iv)  =>  N,  where  u,v  are 
variables  of  type  unit,  not  occurring  in  M,N . 

Broadcasting.  p2{x,  [ya,  ■  ■  ■  which  is  defined  to 

be  [(3;,2/o), . . .  ,  (x,  j/n-i)],  can  be  expressed  as  p2{x,y)  = 
map(X{v).(x,v)){y)  and  has  the  type  p2  :  s  x  [t]  [s  x  t]. 
When  X  itself  is  a  sequence,  P2(x,  y)  essentially  computes  the 
cartesian  product  of  x  and  y.  (The  name  P2  is  motivated  by 
other  considerations  [BBW92].) 

Bounded  monotone  routing.  bm-route{{u,d),x),  ex¬ 
pressed  as  Jli{flatten{map{p2){zip{x,split{u,d))))),  has  type 
bm.route  :  ([s]  X  [N])  x  [t]  [<].  bm-route  is  essentially  the 

same  operation  as  secribed  in  section  2.  E.g.  bm-route{{[uo , 
ui,«2,t'o,'yi],[3,0,2]),[a,i,c])  =  [a,a,a,  c,c].  The  bound 
u  prohibits  us  from  constructing  a  very  long  sequence  in 
constant  parallel  time.  An  unbounded  monotone  routing 
m-route  :  [N]  x  [t]  — *•  [<]  can  be  defined  in  J\fSC  (with  while), 
but  requires  more  than  a  constant  number  of  parallel  steps. 
This  is  indeed  necessary,  since  m.route{[n],[a\),  produces 

the  sequence  [a,  a . a]  of  lenght  n,  whose  size  is  not  poly- 

nomially  bounded  by  the  input.  Finally,  note  that  in  the 
conte.xt  of  nested  sequences,  our  bounded  monontone  rout¬ 
ing  is  not  truly  “monotone”.  Indeed,  l)m_ro«fe(([(),  ()],  [2]), 
[[a,fe,  c]])  =  [[a.J,  c],[a,6,  c]],  and  the  relative  order  of  a,  b 
and  c  has  not  been  preserved.  This  forces  us  to  introduce 
sbrri-route  in  the  BVRAM  model. 

Selections,  cri  :  [s-l-f]  — *■  [s],  (T2  :  [s-|-t]  —  [t].  <ti(x)  se¬ 
lects  from  some  sequence  x  only  those  elements  which  have 
the  form  ini(u),  while  (72  (x)  selects  only  the  elements  of  the 
form  Jn2(u).  E.g.  if  x  =  [ini{a),in2{b),in2{c),in2{d),ini{e), 
in2{f)],  then  cti{x)  =  [o,e],  (T2(x)  =  [b,c,d,f].  <ti  is  defined 

by  cri(x)  flatten[\{u).case  u  of  ini  («')  =>  [«']  in2(u")  => 
[])(x),  and  (72  is  defined  similarly. 

Operations  on  lists,  first  and  tail  can  be  defined  by: 

first(x)  get(get(bm-route{{[{)],[l,0]), 

split{x,  [1,  length{x)  —  1])))) 
tail{x)  pef(fcra-roufe(([()],[0,  1]), 

split(x,[l,length(x)  —  1]))) 

If  X  is  empty,  split  will  produce  an  error.  Similary  we  can  de¬ 
fine  last  and  remove-last,  which  return  the  last  element,  and 
delete  the  last  element  from  a  sequence,  respectively.  In  gen¬ 
eral,  we  can  access  any  element  of  some  sequence  of  length 
n  in  0(1)  parallel  time,  and  with  0{n)  work  complexity  (we 
formally  define  below  the  time  and  work  complexity).  Using 
map,  we  can  produce  an  arbitrary  permutation  in  0(1)  par¬ 
allel  time,  but  with  an  increase  of  the  work  complexity  to 


O(n^).  Using  radix  sort  in  base  n‘,  for  some  arbitrary  e  >  0, 
we  can  even  compute  an  arbitrary  permutation  in  0(1)  par¬ 
allel  time  with  work  complexity.  Alternatively,  we 

can  use  an  optimal  sorting  algorithm  (see  e.g.  [Jaj92]),  which 
reduces  the  work  complexity  to  0[n)  by  increasing  the  time 
complexity  (e.g  the  sorting  algorithm  described  in  section  5 
has  T  =  0(log  n  log  log  «)).  Thus,  the  cost  of  performing  an 
arbitrary  permutation  is  visible  in  the  higher  level  language. 

The  compilation  theorem  7.1  is  robust  enough  to  hold  if 
f/SC  is  extended  with  additional  primitives,  like  a  general 
permutation  permute  or  scan  operations,  provided  that  cor¬ 
responding  instructions  are  added  to  the  BVRAM  model. 
E.g  theorem  7.1  can  be  extended  to  prove  that  ffSC  -|- 
permute  can  be  efficiently  compiled  into  BVRAM-l-permufe. 
But  in  its  present  form  theorem  7.1  is  stronger,  because  it 
proves  a  general  permutation  is  not  necessary  in  a  BVRAM 
in  order  to  compile  efficiently  a  high-level  language  like  ffSC. 
This  is  of  importance  in  view  of  the  high  cost  of  implement¬ 
ing  a  general  permutation  on  existing  massively  parallel  ar¬ 
chitectures  [KLGLS90]. 

As  promised,  we  will  give  a  high-level  definition  of  par¬ 
allel  time  complexity  T  and  work  complexity  W  for  J\fSC 
programs,  in  an  machine  independent  way.  The  idea  is  for 
the  parallel  complexity  of  a  program  to  be  inferred  from 
its  structure  in  the  same  way  in  which  the  sequential  com¬ 
plexity  is  inferred  from  the  structure  of  a  program  in  a 
sequential  language.  In  our  case,  all  primitive  operations 
(including  @  and  flatten)  take  one  parallel  step,  while  in 
a  map{F)([xo,  ■  ■ . ,  Xn-i]),  the  n  executions  of  F  are  done 
in  parallel.  The  iteration  construct  however  may  count  for 
several  steps  hence  our  definition  cannot  be  done  solely  by 
induction  on  programs.  This  is  handled  by  providing  a  for¬ 
mal  operational  semantics  and  then  counting  the  depth  of 
derivations  in  it.  The  work  complexity  is  tied  to  the  size  of 
the  data  that  is  being  manipulated. 

Formally,  we  start  by  defining  S-objects  by  the  gram¬ 
mar:  C  ::=  0  I  n  I  (C,C)  I  ini{C)  \  in2{C)  \  [C,...,C] 
where  n  €  N.  We  only  consider  typed  S-objects  objects.  We 
adopt  a  unit  size  complexity  measure,  and  define  the  size 
of  some  S-object  by  si2e(())  =  size{n)  =  1,  size{{C,  D))  = 
1  +  size{C)  +  size{D),  size{ini{C))  =  size{in2{C))  =  1-1- 
size{C),  size{[Co,. . .  ,Cn-i])  =  1  +  IZ,=o,n-i  ^^z:e{Ci).  We 
use  true  and  false  as  abbreviations  for  mi(())  and  m2(()). 

Next,  we  define  the  evaluation  of  some  term  (also  called 
the  operational  semantics)  in  a  natural  semantics  style,  as 
in  [Kah87].  This  consists  of  rules  which  simultaneously  de¬ 
fine  a  binary  relation  M  \)  C  meaning  that  the  term  AI  eval¬ 
uates  to  the  S-object  C  and  a  ternary  relation  F(C)  1)  C 
meaning  that  the  function  F  applied  to  the  S-object  C  eval¬ 
uates  to  C.  E.g.  if  F  =  Ax./?aNen(z)@[lOO]  and  C  = 
[[3,5],  [2]],  then  F{C)  JJ.  [3,5,2,100].  Some  representative 
rules  are: 


M  Dm 

A  Jj  n 

M  +  N  D  ixt  +  n 

MD(C,B) 

M  D  {C,D) 

wi(M)  Jj  C 

M  DC  N  DF> 

■K2(M)  Ij  D 

M  DC  F(C)  H  D 

iM,N)D(C,L>) 

M  DlCo,...,C,n-i] 

F(M)  li  D 

N  Jj  \Do,...,Dn-l] 

M@N  j;  [Co, . . . ,  Cm-i  ,Do,...,  D„-i] 

F{Co)DDo  ... 

F{C„-.l)  D  Dn-l 

map{F){[Co, . . . ,  C'n-iJ)  JJ-  [Do,  ■  ■  ■ ,  Dn-i\ 
P{C)\).  false 


while{P,F){C)l)C 
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P(C)  JJ.  true  FiC)  j;  C  whileiP,  F){C')  j;  D 
while(P,F)(C)  j;  D 

The  complete  set  of  rules  is  given  in  appendix  B  where 
we  explain  a  technical  complication  caused  by  the  presence 
of  bound  variables  (lambda  abstraction)  in  the  language, 
namely  the  need  to  use  environments  as  in  [Cur88]. 

Thus,  to  evaluate  some  closed  term  M ,  one  has  to  con¬ 
struct  a  proof  tree,  whose  nodes  are  labeled  with  rules  of 
the  operational  semantics,  such  that  its  root  is  labeled  with 
some  rule  with  conclusion  M  if  C.  Based  on  this  operational 
semantics,  we  now  define  the  time  and  work  complexity  of 
AfSC  in  a  machine  independent  way. 

Definition  3.1  Consider  some  AfSC  term  M .  The  time 
and  work  complexity  T{M),W{M)  of  M  if  C  are  de¬ 
fined  by  induction  on  the  proof  of  M  (1  C .  The  induction 
is  done  simultaneously  with  the  definition  of  the  time  and 
work  complexity  T{F,  C)  andW{F,C)  of  some  evaluation 
F{C)  IJ.  D,  where  F  is  a  AfSC  function,  and  C,D  are  S- 
objects.  Except  for  the  rules  for  map  and  while,  for  every 
rule  of  the  form: 

MlifC,,...,MnifCr, 

MifC 


we  define: 


1  -h  ^  T{Mi)  W{M)  SIZE-i-  W{Mi) 


where  SIZE  is  the  total  size  of  all  S-objects  mentioned  in  the 
rule  (in  the  premises  and  the  conclusion,  including  the  en¬ 
vironments).  For  the  map-rule,  the  definition  ofW  remains 
the  same,  while  the  definition  of  T  becomes: 

T(M)  =  1  + max(r(Af.)) 

t  =  l  ,n 

(this  corresponds  to  the  fact  that  the  function  is  applied  in 
parallel  on  all  objects  in  the  sequence).  For  the  while  rule  we 
do  not  include  in  SIZE  the  size  of  the  output  D  (otherwise, 
the  final  output  D  of  while  would  be  counted  as  many  times 
as  many  iterations  are  performed  by  while).  More  precisely, 
if  the  last  rule  of  while{P,  F)(C)  IJ.  D  was: 

P{C)  (1  true  F{C)  1)  C  whileiP,  F)(C')  \f  D 
while[P,  F)iC)  J)  D 


TiwhileiP,F),C)  =  l+T{P,C)-\-TiF,C)  + 

T{whileiP,F),C') 

W [whileiP,  F),  C)  ='  sizeiC)  -|-  sizeiC’)  F 

WiP,C)FWiF,C)F 

WiwhileiP,F),C') 

(i.e.  sizeiD)  is  not  included  explicitly  in  W [whileiP,  F),C)) 


W [whileiP,  F),C) 


The  language  MSC  together  with  its  notions  of  time  and 
work  complexity  is  a  model  of  parallel  computation  in  its 
own  right  but  parallel  algorithms  are  most  commonly  given 
in  terms  of  one  of  the  several  known  flavors  of  PRAM.  To 
facilitate  comparisons,  we  offer  the  following  efficient  simu¬ 
lation  [MSCF  version  of  Brent’s  scheduling  principle,  as  it 
were): 


Proposition  3.2  AnyAfSC  function  of  time  complexity  T 
and  work  complexity  W  can  be  simulated  on  a  CREW  PRAM 
with  scan  primitives  using p  processors  with  asymptotic  com¬ 
plexity  0(r  -(-  Wjp). 

Proof.  (Sketch)  Given  some  function  /  in  AfSC,  first 
flatten  /  for  an  extended  version  of  a  BVRAM,  with  un¬ 
bounded  many  vector  registers  and  indirect  addressing  (es¬ 
sentially  the  VRAM  of  [Ble90],  but  with  the  communication 
primitives  described  in  section  2).  The  resulting  extended- 
BVRAM  program  has  the  same  time  and  work  complexity 
2is  f:  see  remark  7.3.  Next  use  the  simulation  of  an  ex¬ 
tended  BVRAM  on  a  CREW  with  scan  primitives,  in  the 
spirit  of  [Ble90].  We  need  a  CREW  instead  of  a  EREW  in 
order  to  simulate  hm..route  and  sbm-route.  □ 

4  Expressing  map-recursive  functions  in  AfSC 

Although  it  is  described  in  a  concise,  mathematical  style 
(notice  that  we  called  it  a  “calculus”  rather  than  a  “lan¬ 
guage”)  AfSC  can  be  easily  extended  to  a  more  user-friendly 
language,  by  allowing  a  certain  amount  of  block  structure: 
definitions  of  global/local  variables  and  of  nonrecursive  func¬ 
tions.  There  is  a  straightforward  translation  of  such  an  ex¬ 
tension  back  into  AfSC,  which  we  omit  from  this  extended 
abstract.  Accomodating  recursive  functions  though,  is  a 
more  delicate  problem,  which  we  address  here. 

Consider  the  following  limited  form  of  recursion: 

Definition  4.1  A  function  definition  is  map-recursive  if 
it  has  the  form 

fun  fix)  —  c(x,  map(/)(d(x))) 

First,  it  is  easy  for  a  compiler  to  check  whether  a  recur¬ 
sive  definition  is  of  this  form  (in  contrast,  containment  [Ble90] 
is  an  undecidable  property).  Second,  this  form  is  general 
enough  to  express  many  existent  parallel  algorithms:  tail  re¬ 
cursive  definitions,  and  what  is  usually  meant  by  divide-and- 
conquer  recursion  (for  instance  the  worked  example  in  sec¬ 
tion  5)  are  map-recursive.  Here  are  some  recursion  schemata 
and  a  sketch  of  how  to  convert  them  into  mop-recursive  form 
(and  in  the  process  “parallelize”  them)  : 

fun  flr(x)  =  if  p(x)  then  s(x)  else  c(p(di(x)),  y(d2(x))) 

fun  h(i)  =  if  p(x)  then  s(x)  else  c(h(d(x))) 

fun  fc(x)  =  if  p(i)  then  s(x)  else 

if  p'ix)  then  c(fc(di(x)),  fc(d2(x))) 
else  c'(fc(di(x)),  kidl^ix)),  kid'six))) 

For  g,  we  construct  a  list  of  lenght  2,  and  recursively  map 
g  on  it  (Quicksort  has  this  form).  For  h,  the  list  will  have 
length  1  (tail  recursion  is  a  particularization  of  this  form),  k 
is  more  interesting,  since  it  divides  its  input  into  either  two 
or  three  subproblems.  Note  that  it  is  not  contained  [Ble90], 
so  the  compilation  techniques  described  here  work  on  some 
cases  on  which  those  of  [Ble90]  don’t.  In  converting  k,  the 
list  will  have  length  1,  3  or  4,  where  the  first  element  is  a 
tag,  and  k  is  slightly  modified  to  return  the  identity  on  the 
tag  (a  sum  of  types  is  used  here). 

The  first  of  our  two  main  results  states  that  map-recursion 
can  be  translated  (in  a  source- to-source  manner)  into  a  AfSC 
expression,  while  preserving  its  time  complexity  and  “al¬ 
most”  preserving  its  work  complexity. 
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Theorem  4.2  Consider  some  function  f  defined  in  AfSC 
extended  with  map-recursion,  with  time  and  step  complex¬ 
ity  T,  W.  Then,  for  any  s  >  0,  one  can  construct  a  func¬ 
tion  /'  in  AfSC  which  is  equivalent  to  f  and  which  has  time 
and  work  complexity  T'  =  0(T)  and  W  =  re¬ 

spectively.  Moreover,  if  the  divide  and  conquer  tree  of  f  is 
balanced,  then  W'  —  0{W). 

Proof.  For  illustration,  we  consider  only  the  function  g 
from  above.  Suppose  the  types  are:  g  ■.  s  ^  t,  di,d2  :  s  s, 
and  c  :  t  X  t  t.  Not  surprisingly,  g  can  be  expressed  in 
MTIA,  without  recursion,  in  two  steps,  called  divide  phase 
and  combine  phase  in  [MH88]: 

Divide  Phase  Start  with  the  singleton  sequence  y  =  [x] 
of  type  [s],  and  apply  repeatedly  the  function  flatten 
0  map{\x.'d  p{x)  then  [x]  else  [di(x),  ^2(1)])  having 
the  type  [s]  — +  [s],  until  all  its  elements  satisfy  the 
predicate  p.  (We  need  to  tag  the  elements  resulting 
from  [x],  to  avoid  applying  p  repeatedly  on  them;  we 
omit  the  details.)  Call  y  the  resulting  sequence. 

Combine  Phase  Start  by  map-ing  the  function  s  on  y, 
and  then  apply  repeatedly  c  to  adjacent  elements  of 
y:  some  additional  bookkeeping  is  necessary  to  make 
sure  c  is  applied  to  the  correct  pairs  (e.g.,  it  suffices  to 
store  the  depth  in  the  divide  and  conquer  tree  for  each 
element  in  y,  and  only  combine  adjacent  elements  if 
they  have  the  same  depth).  Stop  when  there  is  only 
one  element  in  the  resulting  list. 

Obviously,  the  translated  g  will  have  time  complexity 
0{T).  The  work  complexity  is  also  preserved,  in  the  case  in 
which  the  divide  an  conquer  tree  for  the  computation  of  g{x) 
is  perfectly  balanced.  When  the  is  unbalanced,  the  leaves 
which  are  reached  sooner  have  to  coexists  in  the  same  se¬ 
quence  with  those  nodes  which  need  more  divide  steps,  thus 
adding  to  the  total  work  complexity.  Let  1/  be  the  number  of 
different  levels  in  the  divide  and  conquer  tree  which  contain 
leaves  E.g.  in  an  almost  perfectly  balanced  tree,  1/  =  1  or 
u  =  2,  while  in  a  total  “unbalanced”  tree,  v  can  be  equal  to 
the  total  number  of  leaves,  but  still  v  <  W{g,x).  We  can 
compute  u  in  time  and  work  complexity  0{T),  0(W),  by 
simulating  only  the  divide  phase,  without  retaining  the  re¬ 
sults.  Let  e  >  0.  We  improve  the  dividephase,  such  that  the 
time  and  work  complexities  of  the  translation  of  g  into  AfSC 
become  0{T)  and  0{MW)  respectively.  Namely,  we  start 
with  7  -h  1  variables  z,,  i  =  0,...,  7,  initialized  to  Q,  and 
with  y  initialized  to  the  singleton  [xj.  We  apply  repeatedly 
the  divide  phase  on  y;  whenever  some  leaves  are  reached,  we 
move  them  into  zg .  We  only  allow  zg  to  be  touched  A  times, 
after  which  we  move  its  entire  content  into  z^ ,  and  empty 
zg.  We  repeat  this  process,  but  only  allow  z-^  to  be  touched 
times,  at  which  point,  we  empty  zi,  by  moving  every¬ 
thing  into  Z2.  In  general,  we  allow  z,  to  accumulate  only  i/‘ 
times,  after  which  we  empty  it,  by  moving  everything  into 
z,+i.  Obviously,  a  number  of  levels  of  leaves  must  be 
discovered,  before  making  one  move  into  z,;  thus,  zx  will 
be  filled  exactly  once,  with  the  leaves  from  all  1/  levels.  To 
compute  the  total  additional  work  complexity,  observe  that 
each  leave  travels  exactlv  once  through  zg,  zj , . . . ,  zx,  and 

e 

in  each  z,  is  “touched”  exactly  M  times.  Thus,  the  total 
work  complexity  is  bounded  by  (7  -f  1)AW  =  0(AW). 
Of  course,  rather  complicated  bookkeeping  is  necessary  to 
keep  all  elements  in  z,  sorted.  The  combine  phase  is  done 
similarly,  but  in  reverse.  D 


The  technique  of  theorem  4.2  seems  to  extend  to  more 
general  recursion  schemas  than  the  limited  recursion.  The 
main  kind  of  recursion  to  which  this  technique  does  not  ap¬ 
ply  is  one  in  which  some  recursive  caO  to  /  uses  an  argument 
which  is  computed  with  a  recursive  call  itself,  in  the  style  of 
the  Ackerman  function:  A{x,  y)  =  A{x  —  1,  A(x,  j/  —  1)).  We 
argue  that  very  few  practical  algorithms  make  indeed  use  of 
such  recursion  schemas. 

5  An  0(log  n  log  log  n)  Mergesort  Algorithm  Ex¬ 
pressed  in  MSC 

As  evidence  for  the  practica/ expressiveness  of  AfSC  we  de¬ 
scribe  in  it  Valiant’s  fast  mergesort  algorithm  [Val75,  Jaj92], 
see  the  program  in  figures  1,  2,  3.  As  we  have  explained  at 
the  beginning  of  section  4  we  are  free  to  use  block  struc¬ 
ture  (we  choose  a  syntax  close  to  ML  [MTH90]).  More  im¬ 
portantly,  in  view  of  theorem  4.2  we  are  free  to  use  map- 
recursive  definitions,  or  other  recursive  schemas  which  are 
convertible  to  map-recursion.  The  main  function  mergesort 
in  figure  1  has  the  same  recursion  schema  as  the  function  g 
of  section  4  and  hence  can  be  converted  to  a  map-recursive 
form.  Its  parallel  time  complexity  is  0(log  reloglog  n). 

The  fast,  0(log  log  m)  time  merge  function  exhibits  a 
more  complicated  kind  of  map-recursion.  To  merge  two 
sequences  A  =  [ao  1  •  •  •  )  ^m—\  ],  B  =  [60,  •  •  ■  ,6n-i],  we  di¬ 
vide  A  into  \y/rn]  subsequences  of  length  <  ^/m■,  let  AA  = 
[Ag, . . . ,  A^_j]  be  the  resulting  nested  sequence.  Next, 
we  find  for  each  subsequence  Ai  the  corresponding  subse¬ 
quence  Bi  in  B,  with  which  Ai  has  to  be  merged,  and 
apply  recursively  merge  on  all  pairs  (Ai,Bi)',  let  BB  = 
[Bo, . . . ,  B,y^_i].  Thus,  the  general  structure  of  merge  is: 

fun  merge{A,  B)  = 

if  length{A)  <  2  then  direct. merge{ A,  B) 

else  let  ...compute  AA,  BB  as  explained 

in  flatten{map{merge){zip{AA,  B B)))  end 

which  can  be  obviously  translated  into  a  map-recursion. 

Figures  2  and  3  contain  some  auxiliary  functions  used  in 
merge.  The  function  index{C,  I)  expects  a  sorted  sequence 
of  indexes  /  =  [jo,...,u_i]  and,  for  C  =  [Cg , . . . ,  C„-i], 
returns  the  sequence  . . . ,  it  has  constant  time 

complexity  and  work  complexity  =  0{n  -f  k).  The  func¬ 
tion  index.split[C ,  I)  splits  C  according  to  the  indexes  in  J, 
again  provided  that  1  is  sorted,  with  similar  time  and  work 
complexity.  We  use  the  construct  filter{P)  :  [<]  — ►  [t],  which 
for  some  predicate  P  :  t  ^  B  returns  the  sequence  of  all 
elements  satifying  P.  It  is  expressibel  in  AfSC  by: 

filter{P){x)  =  flatten{map[Xu. if  P{u)  then[«]  else  0)(^)) 

The  functions  first,  tail,  last,  removeJast  and  bm.route 
are  defined  in  section  3. 

Using  the  techniques  described  in  [Jaj92],  the  merge  func¬ 
tion  can  be  transformed  to  become  optimal,  i.e.  to  reduce  its 
work  complexity  from  0((m  -f-  n)  log  log  m))  to  0(m  -f  n). 
This  also  gives  us  an  optimal  (i.e.  with  0(7)  log  n)  work 
complexity),  0(log  7)  log  log  7))-time  sorting  function.  The 
divide-and-conquer  trees  for  both  the  sorting  and  the  merg¬ 
ing  function  are  balanced,  hence  the  translation  of  theo¬ 
rem  4.2  gives  us  an  optimal  0(log  7)  log  log  7))-time  sorting 
function  in  AfSC. 
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fun  mergesort(A)  = 

if  length{A)  <  1  then  A 
else  let  val  n  =  length{A) 

val  AA  =  split{A,  [n  —  ji/2,  n/2]) 
in  merge{mergesort{first{AA), 
mergesort{last{AA)))) 
end 

fun  merge{A,  B)  = 

if  length{A)  <  2  then  direct jmerge{A,  B) 
else 

let  val  m  =  length{A) 
val  n  =  length{B) 
val  ^4'  =  sqrt-positions{A) 
val  B'  =  sqrt-positions{B) 

(*  A',  B'  have  lengths  and  y/n  respectively  *) 
val  R'  =  direct-rank  {A' ,  B') 
val  551  =  sqrt-split{B) 

(*  split  B  into  ,/n  blocks  *) 
val  a-B  =  zip{A' ,index[BBl,  R')) 

(*  group  each  a’  with  its  block  *) 
val  RR'  —  map(rank-one){a-B) 

(*  rank  each  a’  in  its  block  *) 
val  R  =  map{X{x,  y).{x  —  1)  +  \/n  +  y) 

{zip{R',  RR')) 
val  AA  =  sqrt-splitA 
val  55  =  index-split{B,  R) 
in  flatten{map{merge){zip{AA,  55))) 
end 


Figure  1:  Valiant’s  O (log  n  log  log  n)  sorting  algorithm. 


fun  index(C,  I)  = 

let  val  n  =  length[C) 
val  k  =  length(I) 
val  zero-to-k  =  enumerate[I)@[k] 
val  deltaJ  =  map(— )(^ip(/@[n],  [0]@/)) 
val  P  =  hm.route{[C,  deltaJ),  zero-toJc) 
val  delta-P  —  map[  —  )[zip{P,  remoueJast([0]@5))) 
in  bm-route[(l ,  delta-P),  C) 
end 

fun  index-split(C,  I)  = 
let  val  n  =  length{C) 
in  split(C,  r7iap{— )(2fp(J@[n],  [0]@7))) 
end 


Figure  3:  The  functions  index  and  index-split. 

6  Theoretical  Expressive  Power 

In  this  section  we  give  evidence  that  AfSC  is  not  too  restric¬ 
tive,  as  a  tool  for  designing  parallel  algorithms.  Namely, 
let  CRCW-TIME-PROC(r(7i),  P{n))  be  the  set  of  functions 
computable  on  a  CRCW  PRAM  in  time  T{n)  using  P{n) 
processors,  and  A/'5C-TIME-W0RK(T(ra),  W[n))  the  set  of 
functions  expressible  in  AfSC  with  time  and  work  complex¬ 
ity  Tin),W(n). 

Proposition  6.1  ForT(n),W{n),  t/jo<  ore  suitable  (in  the 
sense  of  [SV84]),  we  have: 

CRCW-TIME-PROC{OiTin)),  0(W{n)))  C 


AfSC-TIME-WORKiO{T{n)),W(n)°^^^) 

More,  we  get  equality,  if  in  the  definition  of  AfSC  we  restrict 
the  arithmetic  operations  to  the  set  E  =  {  +  ,—},  etnd  if  we 
replace  the  unit  size  complexity  (size{n)  =  1  -  see  section  3) 

with  the  logarithmic  size  complexity  (size{n)  =  logn),  in 
the  definition  of  the  work  complexity  of  AfSC. 

The  proof  uses  a  theorem  in  [SV84],  credited  to  Ruzzo 
and  Tompa,  relating  CRCW  PRAM’s  to  Alternating  Turing 
Machines,  and  is  omitted  from  this  extended  abstract.  Using 
the  above  proposition  and  proposition  3.2  we  can  establish 
that  NC  coincides  with  the  functions  in  AfSC  with  polylog- 
arithmic  time  and  polynomial  work  complexity.  Recall  that 
AfSC  is  parameterized  by  a  set  E  of  arithmetic  operations. 

Proposition  6.2  Suppose  all  arithmetic  operations  in  S  are 
in  NC.  Then: 

NC  =  AfSC-TIME-  WORK{\og^^^'>  n,  n°^*’) 

7  Efficient  Compilation  of  AfSC  to  BVRAM 

Theorem  7.1  (Compilation  Theorem)  For  every  func¬ 
tion  f  in  AfSC  with  time  and  work  complexity  T,  W ,  there  is 
a  BVRAM,  M,  such  that:  Ve  >  0,  there  is  some  program  P 

Figure  2:  Auxiliarv  functions  used  in  merge.  for  M,  o<l^^ole’at  to  f,  havzng  Ume  complex^ty  T'  =  0(T) 

®  ‘  andW=0{W^^^). 


Note  that,  in  contrast  to  theorem  4.2,  the  number  of  reg¬ 
isters  only  depends  on  f  and  not  on  £.  A  inh i/e-construct 
can  be  rewritten  as  a  tail  recursive  function,  hence  is  con¬ 
tained,  according  to  the  definition  in  [Ble90],  and  therefore 
the  compilation  technique  described  there  (for  a  VRAM, 
with  unbounded  many  vector  registers)  preserves  its  step 
and  work  complexity.  However,  we  cannot  apply  that  com¬ 
pilation  technique  here.  Indeed,  when  viewed  as  tail  recur¬ 
sive  function,  the  work  complexity  of  while  may  increases 
signihcantlv,  because  the  final  result  after  iterating  n  steps 
is  touched  n  additional  times,  as  the  tail  recursive  function 
returns  from  its  calls.  In  the  definition  of  the  work  com¬ 
plexity  for  while,  these  n  additional  touches  are  not  counted 
(see  definition  3.1).  So  the  tail  recursive  translation  has  a 
higher  work  complexity  than  the  original  while  construct. 
We  need  a  stronger  compilation  technique  in  order  to  stay 
within  the  lower  work  complexity.  Moreover,  we  also  only 
have  a  bounded  number  of  vector  registers. 

The  proof  goes  through  the  following  steps: 

•  Variable  Elimination.  We  translate  J\f SC  into  a  rather 
similar,  but  variable  free  language  called  Nested  Rela¬ 
tional  Algebra,  MS  A..  The  new  language  only  contains 
functions  fs  ^  t,i.e.  no  terms.  Some  term  M  in  A/'5C, 
of  type  t  and  with  free  variables  Si  :  si,...,Xn  ■  Sn, 
will  be  translated  into  a  function  /m  :  si  x  . . .  x  Sn  ^  < 
in  MSA.  The  primitive  functions  and  the  constructs 
in  A/'5-4correspong  roughly  to  those  in  Af SC,  with  only 
one  additional  primitive:  the  function  p2  :  s  x  [t]  — ♦ 
[s  X  t]  (see  section  3  for  its  definition).  The  step  and 
work  complexity  of  functions  expressed  in  AfSC  and 
MSA  are  the  same.  We  omit  the  description  of  AfSA 
from  this  extended  abstract;  it  can  be  found  in  ap¬ 
pendix  C. 

•  Flattening.  We  define  a  language  for  flat  sequences, 

called  Sequence  Algebra  5,4,  and  translate  M 5,4  into 
SA.  Namely,  for  any  £  >  0,  we  show  how  to  translate 
a  function  /  of  AfSA  with  time  and  work  complexity 
T,  W  into  an  equivalent  function  in  5,4  (thus  using 
only  flat  types),  with  time  and  work  complexity  0{T) 
and  Of  course,  any  function  in  5,4  can  be 

expressed  in  AfSA  with  the  same  time  and  work  com¬ 
plexity. 

•  We  show  that  5yl  and  BVRAM  are  equivalent,  in  the 
sense  that  any  function  in  5,4  can  be  simulated  by  a 
BVRAM  with  the  same  time  and  work  complexity,  and 
conversely.  One  direction  of  this  equivalence  helps  us 
completing  the  compilation,  while  the  other  direction 
allows  us  to  perform  optimizations  at  the  level  of  the 
language  SA,  instead  of  BVRAM. 

7.1  The  Sequence  Algebra,  5A 

The  Sequence  Algebra,  SA,  only  has  flat  types.  More  pre¬ 
cisely,  we  define  first  scalar  types  by  the  grammar:  s  ::= 
unit]  N  I  s  X  s  I  s  -f  s,  and  next  define  the  flat  types  by  the 
grammar:  t  ::=  unit  |  [s]  |  <  x  t  |  <  -f-  t. 

SA  was  designed  by  choosing  some  set  of  functions  ex¬ 
pressible  in  AfSA  (or,  equivalent,  AfSC)  over  flat  types, 
which  seemed  to  be  enough  to  allow  the  language  AfSA 
to  be  translated  (flattened)  into  SA.  In  addition,  SA  is 
defined  in  an  inductive  way,  which  enables  us  to  prove,  by 
induction,  properties  about  the  functions  expressible  in  SA, 
e.g.  lemma  7.2.  5A  stands  in  the  same  relationship  ioAfSA 


as  the  relational  algebra  stands  to  the  nested  relational  al¬ 
gebra  [AB88]. 

Similar  to  AfSA,  SA  is  a  variable-free  language,  contain¬ 
ing  some  primitive  functions,  and  a  set  of  rules  for  combin¬ 
ing  them  in  order  to  get  more  complex  functions.  We  briefly 
describe  SA  below.  A  complete  description  of  the  language 
can  be  found  in  appendix  D. 

•  Error,  viewed  as  a  function  Q  :  unit  — :■  t. 

•  map’ s  of  scalar  functions,  map{(p)  :  [s]  ^  [s'],  where 

:  s  — >  s'  is  a  scalar  function,  i.e.,  informally,  a  func¬ 
tion  defined  in  AfSA  (or,  equivalently  Af  SC)  having 
only  scalar  types  as  input,  output,  and  intermediate 
types,  and  without  while. 

•  Operations  on  sequences:  the  empty  sequence  Q,  ap¬ 
pend  @,  length  of  a  sequence,  defined  as  length[x)  = 
[u],  where  n  is  the  length  of  i,  zip,  hm-route,  sbm-route, 
selections  cri,  (T2  (see  section  3),  and  the  emptyness  test 
empty?,  of  type  [s]  ^  B. 

•  Functions  over  flat  types:  the  identity  id  :  t  t, 
composition  of  functions  gof,  projections  tt;  :  ti  xt2 

ti,  pairing  of  functions  (f,g),  injections  ini  '■  t,  —* 
ti-|-f2,  and  sum  of  functions  /1  +  /2  :  b+fz  — »  where 
fi  :  t,  —*  t  (an  if  construct  can  be  derived  from  this). 

The  latter  is  defined  by:  (/i  +  f2)(ini{x))  =  /i(x) 
and  (/i  -I- /2)(m2(x))  ='  f2{x). 

•  Iteration:  while{p,f)  is  a  function  of  type  t  t,  when¬ 
ever  f  :  t  t  and  p  :  t  — *■  B  (recall  that  B  =  unit-\-unit 
and,  thus,  is  a  type  of  5A). 

As  for  AfSC  we  define  the  the  time  and  work  complexity 
for  some  evaluation  f(C)  JJ.,  where  /  is  a  function  in  SA  and 
C  is  its  input  (a  flat  S-object).  Note  that  in  the  absence  of 
a  general  map  there  is  no  nested  parallelism  in  SA. 

Although  SA  does  not  contain  nested  types,  like  [N  x 
[N  X  N]],  it  is  strong  enough  to  allow  such  types  to  be  eri- 
coded  into  flat  types.  The  key  technical  tool  for  that  is 
to  encode  some  nonflat  type  [t],  where  t  is  a  flat  type,  by 
some  flat  type  SEQ{t).  For  this  we  use  segment  descrip¬ 
tors,  as  in  [Ble90].  Formally,  we  transform  some  flat  type 
t  into  another  flat  type  SEQ(t),  defined  by  induction  on 

t:  (1)  SEQiunit)  =  [N]  (2)  SEQ{[s])  =  [N]  x  [s],  (3) 

SEQ{t  X  t')  ='  SEQ{t)  X  SEQ{t'),  (4)  SEQ{t  +  t')  "= 

[B]  X  SEQ{t)  X  SEQ{t').  The  idea  is  that  SEQ{t),  although 
a  flat  type,  can  encode  sequences  of  elements  from  t,  i.e. 
values  of  type  [t].  The  main  technical  fact  enabeling  us  to 
prove  efficient  compilation  is  the  following  map  lemma. 

Lemma  7.2  (The  Map  Lemma)  .  Let  f  :  t  ^  t’  be 
some  function  in  SA,  and  let  T,  IT  be  the  time  and.  work 
complexity  of  map{f)  (recall  that  map{f  )  is  in  AfSC,  but 
not  in  SA).  Then,  for  every  £  >  0,  there  exists  some  func¬ 
tion  SEQif)  :  SEQ(t)  SEQ(t')  in  SA,  of  time  complex¬ 
ity  0{T)  and  work  complexity  0{W^^^),  which  simulates 
map{f)  :  [t]  — ►  [t'j.  More,  the  structure  of  SEQ{f)  is  inde-^ 
pendent  ofe,  which  implies  that  “number  of  vector  registers” 
used  by  SEQ{f)  is  independent  of  e. 

Proof.  (Sketch)  This  is  done  by  induction  on  the  struc¬ 
ture  of  /.  When  /  is  map  of  a  scalar  function,  SEQ{f)  is 
essentially  the  same  map.  When  /  is  some  operation  on 
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a  sequence,  we  only  mention  that  SEQ{empty?^  is  essen¬ 
tially  a  selection,  SEQ{(T\)  essentially  <ti,  SEQ{hm-route) 
is  a  shm-route,  while  SEQ(sbm-route)  is  another  sbm-route. 
The  only  difficult  case  is  when  /  is  while{p,g).  We  describe 
very  informally  how  to  compute  SEQ(while{p,  g)){x),  with 
X  =  [xo, . .  . ,  Xn-i],  of  a  BVRAM.  We  could  use  the  same 
idea  as  in  theorem  4.2,  but  then  the  number  of  registers 
would  depend  on  £.  Suppose  x  is  in  register  Vo-  We  wUl  use 
only  two  additional  registers,  Vi  and  V2,  which  are  initially 
empty.  Let  t,  be  the  number  of  iterations  of  while{p,g){xi), 
and  assume  without  loss  of  generality  that  to  <  ti  <  ...  < 
tn-i  (we  conceptually  group  all  Xi's  having  the  same  U), 
which  implies  ti  >  i.  Let  6  —  n‘ ,  Wi  —  W{while{p,g),xi) 
and  r  =  7  —  1-  For  the  moment,  assume  that  in  the  sequence 

Xi,g{x,),g^^\xi), . . .,  the  last  value  (on  position  f.)  has  the 
smallest  size,  denoted  by  Si,  so  sit,  <  w,.  The  simulation 
proceeds  in  r  stages.  The  first  stage  starts  by  repeatedly  ap¬ 
plying  SEQ{g)  on  x:  whenever  some  Xi’s  reach  the  end  of  the 
iteration,  move  them  into  Vi ,  until  the  first  ^(<  n  )  values 
are  extracted  from  Vo,  namely  x,,i  =  1,  j.  The  additional 
work  complexity  due  to  repeatedly  touching  the  values  in 
Vi  is  O(n'W).  At  this  point,  we  move  the  entire  Vi  into 
V2.  For  each  of  the  remaining  stages  k  =  l,r  —  1,  apply 
repeatedly  SEQ{g)  on  x,  and  move,  when  they  terminate, 
the  elements  Xi,  i  =  from  Fo  to  Vi:  at  the  end 

of  stage  k,  we  move  the  entire  Vi  into  V2.  The  additional 
work  complexity  due  to  repeatedly  touching  some  element 
Xi  in  Vi  at  this  stage  is  <  Sij~^.  But  since  i  >  gr-k+i  > 
we  have  that  U  >  i  >  hence  the  additional  work 

complexity  for  x;  is  <  SitiS  <  Win‘ ,  which,  when  added 
up,  accounts  for  only  0{n^W)  for  stage  k,  which  adds  up 
to  at  most  0{jn"W)  =  O(n'W)  for  all  r  stages.  During 
all  r  stages,  Vq  is  touched  only  r  times,  for  an  additional 
0{W)  work  complexity.  At  the  end  of  the  last  stage,  all 
Xi’s  (i  =  l,n)  end  up  in  V2,  so  V2  contains  the  result  of 
SEQ{while{p,g)){x). 

Finally  we  have  to  show  how  to  define  SEQ(while{p,  g))(x) 
in  the  general  case,  when  the  sequence  x,,  g{xi),g^  \xi), . . . , 
g^‘‘^  has  a  minimum  size  on  some  position  m;  which  is  not 
necessarily  the  last  one.  In  that  case  we  first  compute  rrii, 
for  each  r.  this  can  be  done  with  complexities  0{T)  and 
0(W),  by  simply  applying  SEQ(g)  repeatedly,  and  elimi¬ 
nating  those  elements  which  reach  the  end  of  their  itera¬ 
tion.  Next  we  split  the  whole  iteration  SEQ{while{p,g)){x) 
in  two  parts,  essentially  by  synchronizing  the  n  parallel  iter¬ 
ations  at  the  moment  when  they  reach  their  minimum  size, 
namely:  (1)  perform  the  n  parallel  iterations,  as  described 
above,  but  stop  the  iteration  over  x;  at  stept  mi,  (2)  con¬ 
tinue  the  n  parallel  iterations,  from  step  mi  to  ti,  using  the 
same  technique,  but  in  reverse  (because  now  the  minumum 
sizes  are  at  the  beginning). 

□ 

Remark  7.3  Had  we  had  arbitrarily  many  registers  instead 
of  a  bounded  number,  we  could  have  designed  SEQ{f)  with 
time  and  work  complexity  0{T)  and  0{W)  (instead  of  0{T) 
and  0(14-"^  +  '),  which  is  used  in  the  proof  of  proposition  3.2. 
Indeed,  for  f  =  while^p,  g),  assume  again  that,  Vz  =  1,  n,  the 
smallest  size,  denoted  s,,  in  the  sequence  Xi,  g{xi),g^^'>{xi), 

.  ,  (;(‘'  )(x.)  is  on  the  last  position.  Then  S EQ {while {p,  g)) 
is  simulated  by  placing,  upon  completion,  each  element  x* 
in  some  different  register  V,  .  At  the  end  we  have  to  com¬ 
bine  the  registers  Vi,...,V„,  which  we  do  in  the  following 
order:  combine  V„  with  the  result  with  Vn—2,  fhe 


result  with  Vi.  The  additional  work  complexity  for  the  com¬ 
bine  phase  due  to  Xi  is  sii,  which  bounded  by  Wi,  because  of 
our  assumption  about  si.  We  can  extend  the  simulation  to 
the  case  when  the  smallest  sizes  Si  are  reached  at  arbitrary 
moments,  using  the  same  technique  as  above. 

FinaOy,  we  flatten  the  language  AfSA  into  5^.  We  start 
by  flattening  the  types.  For  every  type  s  of  AfSA,  we  define 
COMPILE{s)  to  be  a  flat  type,  which  encodes  s.  Namely: 


COMPILE  {unit)  = 
COMPILE{H)  =' 
COMPILE{s  X  s')  =' 

COMPILE{s  -f  s')  =' 

COMPILE{[s\) 


unit 

[N] 

COMPILE{s)  X  COMPILE{s') 
COMPILE{s)  +  COMPILE  {s') 
SEQ{COMPILE{s)) 


Also,  we  define  the  functions  encodes  :  s  —*  COMPILE{s) 
and  decodes  :  COMPILE {s)  s  in  AfSA,  with  time  com¬ 
plexity  0(1)  and  work  complexity  linear  in  the  size  of  the 
input,  with  the  property  decode s{encodes{x))  —  x,  for  every 
X  £  s.  The  definition  of  the  functions  encode  and  decode 
are  rather  standard,  and  are  omitted  from  this  extended 
abstract. 

Finally,  we  can  prove: 


Proposition  7.4  Let  f  :  s  s'  be  some  function  in  Af SA 
with  time  and  work  complexity  T,  W.  Then,  for  every  e  > 
0,  there  is  some  function  COMPILE{f)  :  COMPILE{s)  — >• 
COMPILE{s')  in  SA  which  “simulates  f”,  i.e.  for  every 
X,  COMPILE {f){encode{x))  =  encode{f{x)),  with  time  and 
work  complexity 0{T) ,  0{W^^^) .  Moreover,  f  requires  the 
same  number  of  BVRAM  registers”  for  every  e. 


Proof.  (Sketch)  By  induction  on  the  structure  of  /. 
All  cases  are  straightforward,  except  for  the  case  when  /  = 
map{g),  where  we  use  the  Map  lemma. 


7.2  Equivalence  of  SA  and  BVRAM 

The  types  in  SA  are  slightly  richer  than  those  of  the  BVRAM: 
5A  allows  for  types  like  [unit  -|-  N  -|-  N  x  N]  -t-  [N  x  N]  x 
[N]  -h  unit,  while  the  types  on  the  BVRAM  are  only  of  the 
form  [N]  X  ...  X  [N].  However,  encoding  of  SA  types  into 
BVRAM  types  is  straightforward. 

Proposition  7.5  5A  and  BVAM  are  equivalent,  i.e.  any 
function  f  in  SA  with  time  and  work  complexity  T,  W  can 
be  simulated  on  a  BVRAM  with  the  same  time  and  work 
complexity,  and  conversely. 

Proof.  Simulating  some  function  of  SA  by  a  BVRAM 
program  is  easily  done  by  induction  on  the  structure  of  that 
function.  The  converse  is  slightly  more  involved.  Indeed,  let 
r  be  the  number  of  registers  of  a  BVRAM  M ,  and  h  some 
function  in  SA  of  type  [N]x([N])’'  [N]x([N])’'  performing 

one  step  of  the  program  of  M  (where  the  program  counter  is 
encoded  by  a  singleton  sequence,  on  the  first  position).  By 
iterating  h  we  indeed  achieve  the  desired  time  complexity, 
but  not  the  work  complexity,  since  at  each  step,  the  function 
h  touches  all  r  registers.  To  avoid  this,  we  define  a  sequence 
of  r  functions  /,,  z  =  1,  r.  The  inputs  and  outputs  for  fi  are. 
the  values  of  the  i  “smallest”  resgisters,  at  some  particular 
moment,  the  indexes  of  these  z  registers,  the  size  S  of  the 
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next  largest  register,  and  the  program  counter.  /;  iterates 
the  one-step  function  as  long  as  it  only  affects  the  i  registers 
it  sees,  and  as  long  as  all  the  i  sizes  stay  less  than  S.  If  any 
of  these  conditions  is  violated,  /;  stops.  To  do  its  job,  fi 
calls  vs'hich  iterates  steps  on  M  by  only  looking  at  the 

smallest  i  —  1  registers:  when  /,_i  finishes,  fi  tries  to  do 
one  more  step  by  taking  into  account  the  i’s  smalles  register 
as  well,  which  /,_i  ignores.  If  it  cannot,  then  it  returns  (to 
fi+i).  Else,  it  performs  the  operation,  and  calls  fi-i  again, 
possibly  with  a  different  set  of  f  -  1  registers,  from  the  set 
of  i  registers  it  sees.  ^ 

Although  only  one  direction  of  proposition  is  actually 
needed  for  the  compilation  theorem  7.1,  the  converse  is  sig¬ 
nificant  from  the  point  of  view  of  optimizations:  it  implies 
that  any  optimizations  done  for  the  BVRAM  can  also  be 
performed  at  the  level  of  the  5.4  language. 

8  Conclusions 

We  intend  to  use  MSC  as  a  core  for  a  “real”  parallel  language 
for  querying  nested  collections,  by  adding  proven  features 
such  as  those  encountered  in  functional  languages  like  ML. 
Guaranteed  complexity  bounds  such  as  those  emerging  from 
this  paper  can  serve  as  useful  guidelines  for  language  design, 
especially  in  the  database  area.  Of  course,  the  techniques 
we  have  used  in  the  translation  of  map-recursion  and  in  the 
unnesting  of  nested  parallelism  need  to  be  validated  by  prac¬ 
tical  implementations.  Equally  important  is  to  continue  to 
investigate  the  practical  expressiveness  oiMSC  by  attempt¬ 
ing  to  represent  various  known  efficient  parallel  algorithms. 
Another  direction  of  investigation  is  to  develop  optimization 
techniques  for  this  language  by  using  ideas  that  have  been 
proved  useful  in  databases. 
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A  The  Nested  Sequence  Calculus  J^SC 

We  define  a  type  context  F  to  be  a  set  of  the  form  F  =  {2:1  :  :  «„},  where  Xi  are  variables  and  a;  are  types.  We 

write  F  t>  M  :  t,  or  F  t>  F  :  s  t,  when  we  want  to  say  that,  under  the  type  assumptions  of  F,  the  term  M  has  type  t,  or 

the  function  F  has  type  s  ^  t.  Below  are  the  rules  defining  the  language.  Recall  that  B  =  unit  +  unit. 


Variables,  Errors,  Constants,  Arithmetic 


x:t,r\>x:t  F  l>fi'  :  <  F  1>  n  :  N  ^ 

Fl>Af:N  Ft>Ar:N  ,  Ft>M:N  Ft>A:N 

r  t>  M  op  A  :  N  T  t>M  =  N  :  bool 

Type  products 

Ft>M:s,  Fl>iV:f  F  t>  M  :  s  X  t  F  l>  M  :  a  X  f 
F  l>  0  :  unit  T  t>  {M,N)  :  s  X  t  T  t>  ni{M)  :  s  T  t>  Tr2{M)  :  t 

Type  sums 


F  t>  M  :  a  T  M  :  t _ Fl>M:a  +  f  x  :  s,r  t>  N  :  u  y  :  t,T  t>  P  :  u 

F  t>  ini{M)  :  s  +  t  T  t>  in2{M)  :  s  +  t  F  t>  case  M  of  ini(i)  =4-  N  |  in2[y)  =>  P  :  u 


Functions 


a:  :  a,  F  t>  M  :  t  F  t>  F  :  s  —>■  t,  M  :  s 
F  C>  As  :  s.M  :  s  —<■  t  F  >  F{M)  :  t 


Iteration 


F  t>  P  :  t  —>■  bool  F  t>  F  :  t  —*  t 
F  t>  while(P,  F)  :t  ^  t 


Collections 


Sequences 


F  fc>M  :  f 

■f— D'-Tr  r  o  [M] :  [tj 

F  1>  M  :  [f] 

F  >  length{M)  :  N 


F  t>  M  :  r<]  F  l>  AT  :  [f]  F  t>  M  :  \[t]] 

F  l>  M@N  :  [fj  F  t>  fiatten{M)  :  [f] 

F  1>  M  :  [<]  F  t>  F  :  a  f 

F  >  get{M)  :  t  F  >  map{F)  :  [a]  — »  [f] 


F  1>  M  :  [a]  F  >  A  :  [f]  T  t>  M  :  [t]  F  l>  Af  :  [f]  F  t>  iV  :  [N] 

F  t>  zip{M,  N)  -.[s  X  t\  F  >  enumerate(M)  :  [N]  F  t>  split{M,  N)  :  [[tjj 


Weakening 


F  t>  M  :  < 
a;  :  a,  F  ]>  M  :  t 


[ 
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B  Operational  Semantics 

We  define  an  environment  to  be  a  finite  set  of  the  form  p  =  {ii  =  Ci, . . . ,  x„  =  C„},  where  xi, . . . ,  x„  are  variables,  and 
Cl, . . .  ,Cn  are  S-objects.  We  say  that  p  is  associated  to  some  type  context  F  iff  p  and  F  mention  exactly  the  same  variables 
and  if  the  type  of  C,  is  the  type  of  the  variable  Xi  in  F. 

The  following  rules  define  the  ternary  relation  p  •  M  Jj.  C  and  the  4-ary  relation  p  •  F{C)  JJ.  C ,  where  p  is  associated  to 
some  type  context  F  such  that  F  >M:t,  orF  t>  f  ■  s  —*  t  respectively. 


Variables,  Errors,  Constants,  Arithmetic 

x  =  C,p»xi}-C 

p»  n  \f  n 

p  •  MJ-m  p  .  A  11  n  (similar  for  aU  op  G  E  and  =) 
p  •  M N  \i- m -\- n  ^  ^  ’ 

Type  products 

p.MllC  P.A1J.D 

p  •  M  11  (C,  A)  p  •  M  11  (C,  A) 

p.()^()  p.(M,A)ll(C,Z>) 

7ri(M)llC  7r2(M)llA 

Type  sums 

P«M11C  p*Af  (IC 

p»Mllmi(C)  x^C,p»N  ifD 

P* 

mi (M)  11  mi  (C)  p  •  in2(M)  in2(C) 

p  •  case  M  of  ini  (x)  =r-  A  |  in2[x)  A  11  A 

Functions 

p  •  A  11  C  p  •  F{C)  H  D 

x  =  C,  p»MllA 

p*F(A)ll  I> 

p.(Ax.M)(C)  11  A 

Iteration 

p  •  P(C)  11  false  p  •  P(C)  11  true 

p*F(C)llC'  p  •  while{P,  F)[C')  \f  D 

p  •  while{P,  F){C)  11  C 

p»while[P,F){C)lfD 

Collections 

p.MllC  p*Mll[Co,...C,„-il  p.  AllfAo,...,A„_i] 

P  •  U  -11  U  P  •  [.^]  -11  [CJ  p  •  M@A  11  [Co,  .  .  .  ,  Cm-l  ,  Ao,  .  .  .  ,  Am-lJ 

p»  M  If  [[Coo, Coi ,...], [Cio, Cii, . . .], 

p.Mll[Co,...,Cn_l] 

p  •  flatten{M)  11  [Coo,  Coi ,  •  •  • ,  Cio,  Cn 

. .  .J  p  •  length{M)  11  n 

p.  Mil  [Cl  p.F(Co)llA( 

...  p  •  A(Cn_l)  11  An_l 

p  •  g€t{M)  Jj.  C  p  •  mop(i^)([Co, . . . ,  Cn-\\)  -IL  [i^o,  •  •  • ,  Dn-\\ 

Sequences 

p  •  M  11  [Co, . . . ,  C„_i]  p  •  A  11  [Ao, . . . ,  An- 

1  p.Mll[Co,...,Cn_ll 

p  •  zip(M,  A)  11  [(Co,  Do),  ■  ■  • ,  (C„_i,  A„_i )J 

p  •  enumerate{M)  11  [0, . . . ,  n  —  l] 

p  •  M  Jj  fC*o,  -  -  •  C^no  +  .-.  +  nm-i  ] 

p  •  A  11  [no,  ■ . .  ,  Um-l] 

P 

•  Splzti^Ad ,  )  Jj  [[C’o  J  ■  '  •  j  C'tio— 1  J)  [^^0  J  •  •  •  )  +  j’  *  •  ■  >  [^*^0  +  ■■  ■  +  ’^■>71  — 2  )  •  •  ■  J  +  ■■■  +  ^m  —  1  JJ 

Weakening 

p.MllC 

p.F(C)y  A 

x  =  C',p»M\fC  X 

=  C',  p  •  A(C)  11  A 

12 


C  The  Nested  Sequence  Algebra  MSA 


Errors,  Constants,  Arithmetic 

n  G  N  op  G  Ti 

fl‘  :  unit  —>■  t  n  ;  unit  — »  N  op  :  N  x  N  — >•  N  =:  N  X  N  — *■  IB 

Function  identity  and  composition 

f  :  r  —>■  s  g  :  s  —f  t 
idt  :t  go  f  :r  -*  t 

Type  products 

_  f\  •.  s  tj  f2  ■  s  ^  h  _  _ 

!t  :  <  unit  {fi,  h)  ■  s h  x  t2  xi  ;  x  <2  — >■  <i  7r2  :  b  X  <2 *2 

Type  sums 

_  _  fl  ■  Si  t  f2  ■  S2  ^  t  _ 

ini  '■  tl  — »■  tl  +  ^2  m2  '■  t2  ^  tl  +  t2  fl  +  f2  ■  Si  S2  —*  t  6  '■  (b  +  ^2)  X  t  tl  X  t  +t2  X  t 

Iteration 

p:<-^B  f  :t  -*  t 

while(p,  f)  :  t  -*  t 

Collections 

[]  :  unit  [ij  singleton  :  t  ^  [t\  @  :  [<]  x  [<J  ^  [<]  flatten  :  [[tJJ  [<J 

f  :  s  t 

length  :  [t]  -+  N  get  :  [t]  <  rnap{f)  :  [s]  ->•  [ij 

Sequences 


zip  :  [s]  X  [t]  -+  [s  X  enumerate  :  [t]  — ►  N  split  :  [<]  x  [N]  -+  [[tJJ 

Broadcast  This  replaces  the  “free  variables”  present  in  MSC. 


P2  :  s  X  [<J  [s  X  i] 


The  evaluation  relation  /(C)  j;  C ,  for  /  some  function  in  AfSA  of  type  s  ^  t  and  C,  C  S-objects  of  type  s  and  t 
respectively,  is  defined  in  a  way  similar  to  the  definition  for  MSC,  but  simpler  because  functions  in  Af 5A  do  not  have 
free  variables,  hence  there  is  no  need  for  an  environment.  The  time  and  work  complexity  T(/,  C)  and  W(f,C)  are  defined 
accordingly. 

Proposition  C.l  Any  closed  function  f  G  AfSC  with  time  and  work  complexityT,  W  is  expressible  in  Af  SA  by  some  function 
f  with  time  and  work  complexity  0{T),  0{W),  and  vice  versa.  Thus,  AfSC  andAlSA  have  the  same  expressive  power. 


D  The  Sequence  Algebra  SA 

Scalar  types  are:  s  ::=  unit  |  N  |  s  x  s  |  s  +  s.  Scalar  functions  ip  :  s  s'  Aie  given  by: 


Constants,  Arithmetic 

n  e  N 

op  £  E 

n  :  unit  N 

op  :NxN—t-N  =:NxN^B 

Function  identity  and  composition 

III  II 

ip  :  s  s  ip  :  s  —►5 

ids  '  s 

S  ij)  o  ip  ■.  s  ^  s'' 

Scalar  type  products 

:  5  — ►  5i  P2  '  S  S2 

!s  :  s  ^  unit  tti  :  Si  x  S2  - 

Si  ^2  :  Si  X  S2  ^  52  [tfi ,  952)  :  s  ^  Si  X  S2 

Scalar  type  sums 

ipi  :  Si  ->  5  ip2  :  S2  — >•  S 

mi  :  Si  — >  Si  +  S2  W2  :  S2  — *■  Si  +  52 

+  9^2  •  "i"  ^2  — S'.  (51  +  52)  X5  — »-5i  X  S  $2  'X‘  i 

(Cont’d  next  page) 
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Flat  types  are:  t  unit  |  [s]  |  t  x  <  |  t  +  i.  Functions  in  SA  /  :  t  — >■  t'  are  given  by: 


Errors  and  Scalar  operations 

ip  :  s  ^  s'  a.  scalar  function 
Q'  :  unit  — >  t  map(<p)  :  [s]  ^  [s  ] 

Function  identity  and  composition 

_  f  g:  i'  ^  i" 

idt  :  f  — >  i  g  o  f  :  t  ^  t'' 

Flat  type  products 

_ /i  :  t  -»■  ti  /2  :  t  ->■  <2 

\t  :  t  ^  unit  TTi  :  X  <2  ^  <1  7r2  :  li  x  <2 <2  (/i , /2)  :  i  »  (fi ,  <2) 

Flat  type  sums 

_  _  /i  :  <1  t  /2  :  <2  — ^  1  _ 

mi  :  <1  <1  +  <2  *>^2  :  ^2  “*■  +  ^2  /i  +  /2  :  ti  +  ^2  “*■  i  6  :  {t\  +  t2)  x  t  —*■  ti  x  t  12  x  t 

Iterations 

p:t— >B  f  :  t  t 

while{p,  f):t^t 

Collections 


P  :  unit  — t-  [s]  singleton  :  unit  — ►  [unitj  @  :  [s]  x  [s]  — ►  [sj  length  :  [s]  — >  [NJ 
empty?  :  [s]  Itf  ai  :  [si  +  S2]  — >  1*1  J  172  :  [■si  +  ^2]  — >  [^2] 

Sequences 


zip  :  [s]  X  [s']  — »  [s  X  s']  enumerate  :  [s]  — <■  [N]  bm-route  :  ([s]  x  [N])  x  [s']  — ^  [s'] 

sbm-route  :  ([s]  x  [N])  x  ([s']  x  [N])  — ►  [s'J 


Example  D.l  Informally  we  show  how  to  compute  combine  :  [B]  x  [s]  x  [s]  —  [s],  where  combine{f,x,y)  combines  the  lists 
X  and  y,  according  to  the  flags  given  by  f.  The  resulting  list  will  have  the  same  length  as  f,  and  will  contain  some  Xi  on 
those  positions  where  f  is  true,  and  some  yj  where  f  is  false.  E.g.  when  f  =  [true,  false,  false,  true,  false,  true,  true]  and 
X  =  [xo,xi,X2,X3\,y  =  [yo,yi,  2/2],  then  combine{f,  x,  y)  must  be  [lo,  yo,yi,xi,y2,X2,  X3].  To  compute  combine  in  SA,  start 
by  enumerate-ing  f,  to  get  [0, 1,  2,  3,  4,  5,  6],  and  by  transforming  the  booleans  into  0  and  1,  to  get  [1, 0,  0, 1, 0, 1,  IJ.  Now 
apply  bm-route  to  select  from  the  first  list  those  elements  having  a  1  in  the  second  list,  and  obtain  [0,3,  5,  6].  Similarly, 
we  obtain  [1,2,4].  These  two  lists  tell  us  on  which  position  each  element  of  x  and  y  must  end  up.  Next,  we  subtract  each 
number  in  this  list  from  its  right  neighbor  (by  considering  7  =  length[f)  to  be  the  right  neighbor  of  the  last  element),  with  the 
exception  of  the  first  position,  where  we  also  add  the  number  itself.  I.e.,  we  get:  [0  +  (3  -  0),  5  -  3,  6  -  5,  7  -  6]  =  [3,  2, 1, 1] 
and  [1  +  (2  -  1),  4  -  2,  7  -  4]  =  [2,2,3].  Now  we  bm-route  x  and  y,  using  these  two  lists  as  replication  sequences,  and  get 
[xo,xo,xo.xi,xuX2,xs]  and  [j/o,  2/o,  2/i ,  S/i ,  2/2 , 2/2, 2/2]  respectively  (both  have  the  length  of  f).  Finally,  we  zip  them  together 
with  f,  and  map  some  scalar  function  which  selects  i;  or  yi  according  to  the  flag. 
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